In the present assignment students are expected to choose one pair of compounds a suggested list of moieties representing a start and an end compounds to trace matebolic changes undelying their interconversion. The main aim of the assignment is stated as following: learning ‘how to reconstruct a metabolic pathway using comparative genomics techniques: (1) gene neighborhood analysis; (2) domain fusion analysis; (3) phyletic gene pattern’. Respective pathways should then be traced in two bacterial families: Enterobacteriaceae (including a model Gram-negative species Escherichia coli) and Bacilliaceae (comprising Bacillus subtilis, a model Gram-positive bacterium). For my presonal assignment I choose a pair consisting of pyruvate as a starting molecule and L-valine as an end product because tracing conversion in this pair seemed to be quite a trivial task.
Selected compounds were searched for in KEGG Compound database (http:,,www.genome.jp,kegg,compound,) with the following search output:
## Trivial name Compound ID Empirical formula
## 1 Pyruvate C00022 C3H4O3
## 2 L-valine C00183 C5H11NO2
With the identifiers obtained I sought for the pathways involving both these compounds at KEGG Pathway. Search for these identifiers in the general (map) database resulted in the following pathways identified:
## Entry Name
## 1 map01060 Biosynthesis of plant secondary metabolites
## 2 map05230 Central carbon metabolism in cancer
## 3 map01063 Biosynthesis of alkaloids derived from shikimate pathway
## 4 map01100 Metabolic pathways
## 5 map00290 Valine, leucine and isoleucine biosynthesis
## 6 map00770 Pantothenate and CoA biosynthesis
## 7 map01110 Biosynthesis of secondary metabolites
## 8 map01210 2-Oxocarboxylic acid metabolism
## 9 map01230 Biosynthesis of amino acids
Apparently, some of these pathways imply an undesirably high level of abstraction, while others denote metabolic pathways absent in bacteria. For further analysis I chose ‘Valine, leucine and isoleucine biosynthesis’ pathway (ID: map00290) as it is restricted to the synthesis of the amino acids sororital to L-valine solely. For the sake of brevity the pathway would be hereafter referred to as ‘valine biosynthesis’, and its branches leading to biosynthesis of amino acids other than valine would be ignored. The selected pathway was then analyzed for presence of respective enzyme-encoding genes in E. coli and B. subtilis.
These data were further summarized in the metabolic pathway flowchart. For the sake of consistency, 2-acetolactate mutase absent in both species is ommited.
Taxonomy identifiers for suggested bacterial species were obtained from NCBI taxonomy as stated in the following tables:
## Species name Taxonomy ID
## 1 Escherichia coli str. K-12 substr. MG1655 511145
## 2 Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 99287
## 3 Citrobacter koseri ATCC BAA-895 290338
## 4 Yersinia pestis KIM10+ 187410
## 5 Edwardsiella tarda EIB202 498217
## 6 Erwinia amylovora ATCC 49946 716540
## 7 Proteus mirabilis HI4320 529507
## Species name Taxonomy ID
## 1 Bacillus subtilis subsp. subtilis str. 168 224308
## 2 Bacillus cereus ATCC 14579 226900
## 3 Bacillus clausii KSM-K16 66692
## 4 Bacillus halodurans C-125 272558
## 5 Bacillus licheniformis DSM 13 279010
## 6 Bacillus pumilus SAFR-032 315750
The suggested genomes were selected as queries for 00290 pathway mapping at PATRIC database as suggested in the task wording. In Enteroacteriaceae batch two species, namely Y. pestis and E. tarda were found to be devoid of any genes involved in the pathway (data not shown); in other species all genes were present by at least one copy, and one gene encoding actolactate synthase (EC 2.2.16) was present in more than three paralogous copies in all the species. Speaking of Bacilliaceae, gene encoding valine-piruvate transaminase is absent in all species except for B. licheniformes while other are present in one (e.g. ketol-acid reductoisomerase), two (3-isopropylmalate dehydratase) or varying number of copies.
We then elucidated ortholog distribution in the surveyed genomes by using MicrobesOnline as suggested in the assignment wording:
*The selected salmonella genome was absent in the database so a randomly picked genome for serovar Typhi was used for gene annotation
Consistently with the previously obtained data, these results indicate that genes encoding acetolactate synthase subunits are overrepresented in bacterial genomes, presumably because of multisubunit structure of the enzyme. They also agree on exclusiveness of leucine dehydrogenase genes for Bacilliaceae and alanine-synthesizing transaminase for Enterobacteriaceae. At the same time, several discrepancies were found. For instance, most of the genes indicated as absent by PATRIC were discovered during the MicrobesOnline survey. Also, a gene encoding for leucine dehydrogenase is absent in the reference B. subtilis genome though found in other strains of the same species as well as in several selected Bacilliaceae members. Beyond that, MicrobesOnline offers domain structure of the contained genes. For the sake of consistency only genes from E. coli and B. subtilis were checked for their domain content.
## Gene name Found pfams
## 1 ilvH PF01842,PF10369
## 2 ilvN PF01842
## 3 ilvB PF02776,PF00205,PF02775
## 4 ilvI PF02776,PF00205,PF02775
## 5 ilvC PF07991,PF01450,PF01450
## 6 yagF PF00920
## 7 ilvD PF00920
## 8 ilvE PF01063
## 9 avtA PF00155
## Gene name Found pfams
## 1 ilvH PF01842,PF10369
## 2 ilvB PF02776,PF00205,PF02775
## 3 alsS PF02776,PF00205,PF02775
## 4 ilvC PF02826,PF07991,PF01450
## 5 ilvD PF00920
## 6 ybgE PF01063
## 7 ywaA PF01063
## 8 bcd* PF02812,PF00208
* since bcd gene was absent in the reference strain, information on the respective protein structure was obtained from Bacillus subtilis subsp. subtilis str. NCIB 3610 by proxy.
Genes mined in the previous two tasks were then analyzed for physical lincage by neighborhood. The suggested procedures were carried out to visualize juxtaposing genes for each entry in each genome. The results are presented in the following table. Note that neighboring genes are joined by parentheses.
* genes not identified in the primary search but observed on manual proximity investigation
Apparently, in all the genomes at least one proximity group comprising of core metabolic genes is preserved. In some cases interspersed genes for acetolactate synthase subunits nucleate additional gene islands which may fall under the same regulon in a manner similar to that of core enzyme genes.
For the final I chose E. coli gene ilvD encoding dihydroxy-acid dehydratase (EC 4.2.1.9). Interaction network reconstruction with default settings resulted in ten-node primary shell with moderate connectivity:
Network overview
Then, four parameters were consecutively left as sole source of interaction evidence:
Co-expression
Neighborhood
Gene Fusion
Co-occurrence
It is clear that co-occurrence is the largest as well as densest of all four evidence-wise networks while fusion network indicating reading frame union events is the smallest one with two vertices and one edge only. The most rewarding part of this piece of analysis is concordance between neighborhood network and previous proximity survey which underpins the idea that core enzymes of valine biosynthesis are most likely to be grouped together within the genome.
Despite the lack of convenient APIs, the exploited databases offer great opportunities for comparative genomics. In the case of valine synthesis from pyruvate precursor one could not only trace the metabolic pathway in different bacteria but also dissect similarities in gene content, spatial occurrence and regulation mechanisms among these species.
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=ru_RU.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=ru_RU.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=ru_RU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] magrittr_1.5 dplyr_0.8.5
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 crayon_1.3.4 digest_0.6.25 assertthat_0.2.1
## [5] R6_2.4.1 evaluate_0.14 pillar_1.4.3 rlang_0.4.5
## [9] stringi_1.4.6 rmarkdown_2.1 tools_3.6.3 stringr_1.4.0
## [13] glue_1.3.1 purrr_0.3.3 xfun_0.12 yaml_2.2.1
## [17] compiler_3.6.3 pkgconfig_2.0.3 htmltools_0.4.0 tidyselect_1.0.0
## [21] knitr_1.28 tibble_2.1.3